Skip to content

[CodeLayout] Faster basic block reordering, ext-tsp #68617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 25, 2023

Conversation

spupyrev
Copy link
Contributor

@spupyrev spupyrev commented Oct 9, 2023

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats all blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:

  • reduced the maximum chain size to 512 (from the prior 4096);
  • introduced MaxMergeDensityRatio param to avoid merging chains with
    very different densities;
  • dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
Before:

  • benchmark 1:
    num functions: 13,047
    reordering running time is 2.4 seconds
    score: 125503458 (128.3102%)
  • benchmark 2:
    num functions: 16,438
    reordering running time is 3.4 seconds
    score: 12613997277 (129.7495%)
  • benchmark 3:
    num functions: 12,359
    reordering running time is 1.9 seconds
    score: 1315881613 (105.8991%)
  • benchmark 4:
    num functions: 96,588
    reordering running time is 7.3 seconds
    score: 89513906284 (100.3413%)
  • benchmark 5:
    num functions: 1
    reordering running time is 372 seconds
    score: 21292505965077 (99.9979%)
  • benchmark 6:
    num functions: 71,155
    reordering running time is 314 seconds
    score: 29795381626270671437824 (102.7519%)

After:

  • benchmark 1:
    reordering running time is 2.2 seconds
    score: 125510418 (128.3130%)

  • benchmark 2:
    reordering running time is 2.6 seconds
    score: 12614502162 (129.7525%)

  • benchmark 3:
    reordering running time is 1.6 seconds
    score: 1315938168 (105.9024%)

  • benchmark 4:
    reordering running time is 4.9 seconds
    score: 89518095837 (100.3454%)

  • benchmark 5:
    reordering running time is 4.8 seconds
    score: 21292295939119 (99.9971%)

  • benchmark 6:
    reordering running time is 104 seconds
    score: 29796710925310302879744 (102.7565%)

@spupyrev spupyrev marked this pull request as ready for review October 9, 2023 20:00
@llvmbot
Copy link
Member

llvmbot commented Oct 9, 2023

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Changes

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats all blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:

  • reduced the maximum chain size to 512 (from the prior 4096);
  • introduced MaxMergeDensityRatio param to avoid merging chains with
    very different densities;
  • dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
Before:

  • benchmark 1:
    reordering running time is 2486 milliseconds
    score: 125503458 (128.3102%)
  • benchmark 2:
    reordering running time is 3443 milliseconds
    score: 12613997277 (129.7495%)
  • benchmark 2:
    reordering running time is 1978 milliseconds
    score: 1315881613 (105.8991%)
  • benchmark 4:
    reordering running time is 7364 milliseconds
    score: 89513906284 (100.3413%)
  • benchmark 5:
    reordering running time is 372605 milliseconds
    score: 21292505965077 (99.9979%)

After:

  • benchmark 1:
    reordering running time is 2498 milliseconds
    score: 125510418 (128.3173%)

  • benchmark 2:
    reordering running time is 3201 milliseconds
    score: 12614502162 (129.7547%)

  • benchmark 3:
    reordering running time is 2137 milliseconds
    score: 1315938168 (105.9036%)

  • benchmark 4:
    reordering running time is 6242 milliseconds
    score: 89518095837 (100.3460%)

  • benchmark 5:
    reordering running time is 5819 milliseconds
    score: 21292295939119 (99.9969%)


Patch is 22.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68617.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/CodeLayout.cpp (+143-100)
  • (modified) llvm/test/CodeGen/X86/code_placement_ext_tsp.ll (+2-11)
  • (modified) llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll (+8-8)
diff --git a/llvm/lib/Transforms/Utils/CodeLayout.cpp b/llvm/lib/Transforms/Utils/CodeLayout.cpp
index 057a5e86c04aca1..a7e1393d693e9ef 100644
--- a/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ b/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -99,22 +99,15 @@ static cl::opt<unsigned> BackwardDistance(
     cl::desc("The maximum distance (in bytes) of a backward jump for ExtTSP"));
 
 // The maximum size of a chain created by the algorithm. The size is bounded
-// so that the algorithm can efficiently process extremely large instance.
+// so that the algorithm can efficiently process extremely large instances.
 static cl::opt<unsigned>
-    MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(4096),
-                 cl::desc("The maximum size of a chain to create."));
+    MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(512),
+                 cl::desc("The maximum size of a chain to create"));
 
-// The maximum size of a chain for splitting. Larger values of the threshold
-// may yield better quality at the cost of worsen run-time.
-static cl::opt<unsigned> ChainSplitThreshold(
-    "ext-tsp-chain-split-threshold", cl::ReallyHidden, cl::init(128),
-    cl::desc("The maximum size of a chain to apply splitting"));
-
-// The option enables splitting (large) chains along in-coming and out-going
-// jumps. This typically results in a better quality.
-static cl::opt<bool> EnableChainSplitAlongJumps(
-    "ext-tsp-enable-chain-split-along-jumps", cl::ReallyHidden, cl::init(true),
-    cl::desc("The maximum size of a chain to apply splitting"));
+// The maximum ratio between densities of two chains for merging.
+static cl::opt<double> MaxMergeDensityRatio(
+    "ext-tsp-max-merge-density-ratio", cl::ReallyHidden, cl::init(1000),
+    cl::desc("The maximum ratio between densities of two chains for merging"));
 
 // Algorithm-specific options for CDS.
 static cl::opt<unsigned> CacheEntries("cds-cache-entries", cl::ReallyHidden,
@@ -217,11 +210,13 @@ struct NodeT {
   NodeT &operator=(const NodeT &) = delete;
   NodeT &operator=(NodeT &&) = default;
 
-  explicit NodeT(size_t Index, uint64_t Size, uint64_t EC)
-      : Index(Index), Size(Size), ExecutionCount(EC) {}
+  explicit NodeT(size_t Index, uint64_t Size, uint64_t Count)
+      : Index(Index), Size(Size), ExecutionCount(Count) {}
 
   bool isEntry() const { return Index == 0; }
 
+  // Check if Other is a successor of the node.
+  bool isSuccessor(const NodeT *Other) const;
   // The total execution count of outgoing jumps.
   uint64_t outCount() const;
 
@@ -442,6 +437,14 @@ struct ChainEdge {
   bool CacheValidBackward{false};
 };
 
+bool NodeT::isSuccessor(const NodeT *Other) const {
+  for (JumpT *Jump : OutJumps) {
+    if (Jump->Target == Other)
+      return true;
+  }
+  return false;
+}
+
 uint64_t NodeT::outCount() const {
   uint64_t Count = 0;
   for (JumpT *Jump : OutJumps)
@@ -475,14 +478,14 @@ void ChainT::mergeEdges(ChainT *Other) {
   }
 }
 
-using NodeIter = std::vector<NodeT *>::const_iterator;
+/// A wrapper around three concatenated vectors (chains) of objects; it is used
+/// to avoid extra instantiation of the vectors.
+template <typename ObjType> struct MergedVector {
+  using ObjIter = typename std::vector<ObjType *>::const_iterator;
 
-/// A wrapper around three chains of nodes; it is used to avoid extra
-/// instantiation of the vectors.
-struct MergedChain {
-  MergedChain(NodeIter Begin1, NodeIter End1, NodeIter Begin2 = NodeIter(),
-              NodeIter End2 = NodeIter(), NodeIter Begin3 = NodeIter(),
-              NodeIter End3 = NodeIter())
+  MergedVector(ObjIter Begin1, ObjIter End1, ObjIter Begin2 = ObjIter(),
+               ObjIter End2 = ObjIter(), ObjIter Begin3 = ObjIter(),
+               ObjIter End3 = ObjIter())
       : Begin1(Begin1), End1(End1), Begin2(Begin2), End2(End2), Begin3(Begin3),
         End3(End3) {}
 
@@ -507,13 +510,26 @@ struct MergedChain {
 
   const NodeT *getFirstNode() const { return *Begin1; }
 
+  bool empty() const { return Begin1 == End1; }
+
+  void append(ObjIter Begin, ObjIter End) {
+    if (Begin2 == End2) {
+      Begin2 = Begin;
+      End2 = End;
+      return;
+    }
+    assert(Begin3 == End3 && "cannot extend MergedVector");
+    Begin3 = Begin;
+    End3 = End;
+  }
+
 private:
-  NodeIter Begin1;
-  NodeIter End1;
-  NodeIter Begin2;
-  NodeIter End2;
-  NodeIter Begin3;
-  NodeIter End3;
+  ObjIter Begin1;
+  ObjIter End1;
+  ObjIter Begin2;
+  ObjIter End2;
+  ObjIter Begin3;
+  ObjIter End3;
 };
 
 /// Merge two chains of nodes respecting a given 'type' and 'offset'.
@@ -521,29 +537,29 @@ struct MergedChain {
 /// If MergeType == 0, then the result is a concatenation of two chains.
 /// Otherwise, the first chain is cut into two sub-chains at the offset,
 /// and merged using all possible ways of concatenating three chains.
-MergedChain mergeNodes(const std::vector<NodeT *> &X,
-                       const std::vector<NodeT *> &Y, size_t MergeOffset,
-                       MergeTypeT MergeType) {
+MergedVector<NodeT> mergeNodes(const std::vector<NodeT *> &X,
+                               const std::vector<NodeT *> &Y,
+                               size_t MergeOffset, MergeTypeT MergeType) {
   // Split the first chain, X, into X1 and X2.
-  NodeIter BeginX1 = X.begin();
-  NodeIter EndX1 = X.begin() + MergeOffset;
-  NodeIter BeginX2 = X.begin() + MergeOffset;
-  NodeIter EndX2 = X.end();
-  NodeIter BeginY = Y.begin();
-  NodeIter EndY = Y.end();
+  MergedVector<NodeT>::ObjIter BeginX1 = X.begin();
+  MergedVector<NodeT>::ObjIter EndX1 = X.begin() + MergeOffset;
+  MergedVector<NodeT>::ObjIter BeginX2 = X.begin() + MergeOffset;
+  MergedVector<NodeT>::ObjIter EndX2 = X.end();
+  MergedVector<NodeT>::ObjIter BeginY = Y.begin();
+  MergedVector<NodeT>::ObjIter EndY = Y.end();
 
   // Construct a new chain from the three existing ones.
   switch (MergeType) {
   case MergeTypeT::X_Y:
-    return MergedChain(BeginX1, EndX2, BeginY, EndY);
+    return MergedVector<NodeT>(BeginX1, EndX2, BeginY, EndY);
   case MergeTypeT::Y_X:
-    return MergedChain(BeginY, EndY, BeginX1, EndX2);
+    return MergedVector<NodeT>(BeginY, EndY, BeginX1, EndX2);
   case MergeTypeT::X1_Y_X2:
-    return MergedChain(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);
+    return MergedVector<NodeT>(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);
   case MergeTypeT::Y_X2_X1:
-    return MergedChain(BeginY, EndY, BeginX2, EndX2, BeginX1, EndX1);
+    return MergedVector<NodeT>(BeginY, EndY, BeginX2, EndX2, BeginX1, EndX1);
   case MergeTypeT::X2_X1_Y:
-    return MergedChain(BeginX2, EndX2, BeginX1, EndX1, BeginY, EndY);
+    return MergedVector<NodeT>(BeginX2, EndX2, BeginX1, EndX1, BeginY, EndY);
   }
   llvm_unreachable("unexpected chain merge type");
 }
@@ -618,6 +634,10 @@ class ExtTSPImpl {
     AllChains.reserve(NumNodes);
     HotChains.reserve(NumNodes);
     for (NodeT &Node : AllNodes) {
+      // Adjust execution counts.
+      Node.ExecutionCount = std::max(Node.ExecutionCount, Node.inCount());
+      Node.ExecutionCount = std::max(Node.ExecutionCount, Node.outCount());
+      // Create a chain.
       AllChains.emplace_back(Node.Index, &Node);
       Node.CurChain = &AllChains.back();
       if (Node.ExecutionCount > 0)
@@ -630,13 +650,13 @@ class ExtTSPImpl {
       for (JumpT *Jump : PredNode.OutJumps) {
         NodeT *SuccNode = Jump->Target;
         ChainEdge *CurEdge = PredNode.CurChain->getEdge(SuccNode->CurChain);
-        // this edge is already present in the graph.
+        // This edge is already present in the graph.
         if (CurEdge != nullptr) {
           assert(SuccNode->CurChain->getEdge(PredNode.CurChain) != nullptr);
           CurEdge->appendJump(Jump);
           continue;
         }
-        // this is a new edge.
+        // This is a new edge.
         AllEdges.emplace_back(Jump);
         PredNode.CurChain->addEdge(SuccNode->CurChain, &AllEdges.back());
         SuccNode->CurChain->addEdge(PredNode.CurChain, &AllEdges.back());
@@ -649,7 +669,7 @@ class ExtTSPImpl {
   /// to B are from A. Such nodes should be adjacent in the optimal ordering;
   /// the method finds and merges such pairs of nodes.
   void mergeForcedPairs() {
-    // Find fallthroughs based on edge weights.
+    // Find forced pairs of blocks.
     for (NodeT &Node : AllNodes) {
       if (SuccNodes[Node.Index].size() == 1 &&
           PredNodes[SuccNodes[Node.Index][0]].size() == 1 &&
@@ -699,28 +719,44 @@ class ExtTSPImpl {
     /// Deterministically compare pairs of chains.
     auto compareChainPairs = [](const ChainT *A1, const ChainT *B1,
                                 const ChainT *A2, const ChainT *B2) {
-      if (A1 != A2)
-        return A1->Id < A2->Id;
-      return B1->Id < B2->Id;
+      return std::make_tuple(A1->Id, B1->Id) < std::make_tuple(A2->Id, B2->Id);
     };
 
+    double PrevScore = 1e9;
     while (HotChains.size() > 1) {
       ChainT *BestChainPred = nullptr;
       ChainT *BestChainSucc = nullptr;
       MergeGainT BestGain;
       // Iterate over all pairs of chains.
       for (ChainT *ChainPred : HotChains) {
+        // Since the score of merging doesn't increase, we can stop early when
+        // the newly found merge is as good as the previous one.
+        if (BestGain.score() == PrevScore)
+          break;
         // Get candidates for merging with the current chain.
         for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
           // Ignore loop edges.
-          if (ChainPred == ChainSucc)
+          if (Edge->isSelfEdge())
             continue;
-
           // Stop early if the combined chain violates the maximum allowed size.
           if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
             continue;
+          // Don't merge the chains if they have vastly different densities.
+          // We stop early if the ratio between the densities exceeds
+          // MaxMergeDensityRatio. Smaller values of the option result in
+          // fewer merges (hence, more chains), which in turn typically yields
+          // smaller size of the hot code section.
+          double minDensity =
+              std::min(ChainPred->density(), ChainSucc->density());
+          double maxDensity =
+              std::max(ChainPred->density(), ChainSucc->density());
+          assert(minDensity > 0.0 && maxDensity > 0.0 &&
+                 "incorrectly computed chain densities");
+          const double Ratio = maxDensity / minDensity;
+          if (Ratio > MaxMergeDensityRatio)
+            continue;
 
-          // Compute the gain of merging the two chains.
+          // Compute the gain of merging the two chains
           MergeGainT CurGain = getBestMergeGain(ChainPred, ChainSucc, Edge);
           if (CurGain.score() <= EPS)
             continue;
@@ -732,6 +768,9 @@ class ExtTSPImpl {
             BestGain = CurGain;
             BestChainPred = ChainPred;
             BestChainSucc = ChainSucc;
+            // Stop early when the merge is as good as the previous one.
+            if (BestGain.score() == PrevScore)
+              break;
           }
         }
       }
@@ -741,6 +780,7 @@ class ExtTSPImpl {
         break;
 
       // Merge the best pair of chains.
+      PrevScore = BestGain.score();
       mergeChains(BestChainPred, BestChainSucc, BestGain.mergeOffset(),
                   BestGain.mergeType());
     }
@@ -769,24 +809,25 @@ class ExtTSPImpl {
   }
 
   /// Compute the Ext-TSP score for a given node order and a list of jumps.
-  double extTSPScore(const MergedChain &MergedBlocks,
-                     const std::vector<JumpT *> &Jumps) const {
-    if (Jumps.empty())
+  double extTSPScore(const MergedVector<NodeT> &Nodes,
+                     const MergedVector<JumpT> &Jumps) const {
+    if (Jumps.empty() || Nodes.empty())
       return 0.0;
+
     uint64_t CurAddr = 0;
-    MergedBlocks.forEach([&](const NodeT *Node) {
+    Nodes.forEach([&](const NodeT *Node) {
       Node->EstimatedAddr = CurAddr;
       CurAddr += Node->Size;
     });
 
     double Score = 0;
-    for (JumpT *Jump : Jumps) {
+    Jumps.forEach([&](const JumpT *Jump) {
       const NodeT *SrcBlock = Jump->Source;
       const NodeT *DstBlock = Jump->Target;
       Score += ::extTSPScore(SrcBlock->EstimatedAddr, SrcBlock->Size,
                              DstBlock->EstimatedAddr, Jump->ExecutionCount,
                              Jump->IsConditional);
-    }
+    });
     return Score;
   }
 
@@ -798,17 +839,15 @@ class ExtTSPImpl {
   /// element being the corresponding merging type.
   MergeGainT getBestMergeGain(ChainT *ChainPred, ChainT *ChainSucc,
                               ChainEdge *Edge) const {
-    if (Edge->hasCachedMergeGain(ChainPred, ChainSucc)) {
+    if (Edge->hasCachedMergeGain(ChainPred, ChainSucc))
       return Edge->getCachedMergeGain(ChainPred, ChainSucc);
-    }
 
     // Precompute jumps between ChainPred and ChainSucc.
-    auto Jumps = Edge->jumps();
-    ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);
-    if (EdgePP != nullptr) {
-      Jumps.insert(Jumps.end(), EdgePP->jumps().begin(), EdgePP->jumps().end());
-    }
+    MergedVector<JumpT> Jumps(Edge->jumps().begin(), Edge->jumps().end());
     assert(!Jumps.empty() && "trying to merge chains w/o jumps");
+    ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);
+    if (EdgePP != nullptr)
+      Jumps.append(EdgePP->jumps().begin(), EdgePP->jumps().end());
 
     // This object holds the best chosen gain of merging two chains.
     MergeGainT Gain = MergeGainT();
@@ -836,35 +875,36 @@ class ExtTSPImpl {
     Gain.updateIfLessThan(
         computeMergeGain(ChainPred, ChainSucc, Jumps, 0, MergeTypeT::X_Y));
 
-    if (EnableChainSplitAlongJumps) {
-      // Attach (a part of) ChainPred before the first node of ChainSucc.
-      for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
-        const NodeT *SrcBlock = Jump->Source;
-        if (SrcBlock->CurChain != ChainPred)
-          continue;
-        size_t Offset = SrcBlock->CurIndex + 1;
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
-      }
+    // Attach (a part of) ChainPred before the first node of ChainSucc.
+    for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
+      const NodeT *SrcBlock = Jump->Source;
+      if (SrcBlock->CurChain != ChainPred)
+        continue;
+      size_t Offset = SrcBlock->CurIndex + 1;
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
+    }
 
-      // Attach (a part of) ChainPred after the last node of ChainSucc.
-      for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
-        const NodeT *DstBlock = Jump->Target;
-        if (DstBlock->CurChain != ChainPred)
-          continue;
-        size_t Offset = DstBlock->CurIndex;
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
-      }
+    // Attach (a part of) ChainPred after the last node of ChainSucc.
+    for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
+      const NodeT *DstBlock = Jump->Target;
+      if (DstBlock->CurChain != ChainPred)
+        continue;
+      size_t Offset = DstBlock->CurIndex;
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
     }
 
     // Try to break ChainPred in various ways and concatenate with ChainSucc.
-    if (ChainPred->Nodes.size() <= ChainSplitThreshold) {
-      for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
-        // Try to split the chain in different ways. In practice, applying
-        // X2_Y_X1 merging is almost never provides benefits; thus, we exclude
-        // it from consideration to reduce the search space.
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
-                                 MergeTypeT::X2_X1_Y});
-      }
+    // In practice, applying X2_Y_X1 merging is almost never provides benefits;
+    // thus, we exclude it from consideration to reduce the search space.
+    for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
+      // Do not split the chain along a jump.
+      const NodeT *BB = ChainPred->Nodes[Offset - 1];
+      const NodeT *BB2 = ChainPred->Nodes[Offset];
+      if (BB->isSuccessor(BB2))
+        continue;
+
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
+                               MergeTypeT::X2_X1_Y});
     }
     Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);
     return Gain;
@@ -875,19 +915,20 @@ class ExtTSPImpl {
   ///
   /// The two chains are not modified in the method.
   MergeGainT computeMergeGain(const ChainT *ChainPred, const ChainT *ChainSucc,
-                              const std::vector<JumpT *> &Jumps,
+                              const MergedVector<JumpT> &Jumps,
                               size_t MergeOffset, MergeTypeT MergeType) const {
-    auto MergedBlocks =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(ChainPred->Nodes, ChainSucc->Nodes, MergeOffset, MergeType);
 
     // Do not allow a merge that does not preserve the original entry point.
     if ((ChainPred->isEntry() || ChainSucc->isEntry()) &&
-        !MergedBlocks.getFirstNode()->isEntry())
+        !MergedNodes.getFirstNode()->isEntry())
       return MergeGainT();
 
     // The gain for the new chain.
-    auto NewGainScore = extTSPScore(MergedBlocks, Jumps) - ChainPred->Score;
-    return MergeGainT(NewGainScore, MergeOffset, MergeType);
+    double NewScore = extTSPScore(MergedNodes, Jumps);
+    double CurScore = ChainPred->Score;
+    return MergeGainT(NewScore - CurScore, MergeOffset, MergeType);
   }
 
   /// Merge chain From into chain Into, update the list of active chains,
@@ -897,7 +938,7 @@ class ExtTSPImpl {
     assert(Into != From && "a chain cannot be merged with itself");
 
     // Merge the nodes.
-    MergedChain MergedNodes =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
     Into->merge(From, MergedNodes.getNodes());
 
@@ -908,8 +949,10 @@ class ExtTSPImpl {
     // Update cached ext-tsp score for the new chain.
     ChainEdge *SelfEdge = Into->getEdge(Into);
     if (SelfEdge != nullptr) {
-      MergedNodes = MergedChain(Into->Nodes.begin(), Into->Nodes.end());
-      Into->Score = extTSPScore(MergedNodes, SelfEdge->jumps());
+      MergedNodes = MergedVector<NodeT>(Into->Nodes.begin(), Into->Nodes.end());
+      MergedVector<JumpT> MergedJumps(SelfEdge->jumps().begin(),
+                                      SelfEdge->jumps().end());
+      Into->Score = extTSPScore(MergedNodes, MergedJumps);
     }
 
     // Remove the chain from the list of active chains.
@@ -1255,7 +1298,7 @@ class CDSortImpl {
   }
 
   /// Compute the change of the distance locality after merging the chains.
-  double distBasedLocalityGain(const MergedChain &MergedBlocks,
+  double distBasedLocalityGain(const MergedVector<NodeT> &MergedBlocks,
                                const std::vector<JumpT *> &Jumps) const {
     if (Jumps.empty())
       return 0.0;
@@ -1283,7 +1326,7 @@ class CDSortImpl {
     assert(Into != From && "a chain cannot be merged with itself");
 
     // Merge the nodes.
-    MergedChain MergedNodes =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
     Into->merge(From, MergedNodes.getNodes());
 
diff --git a/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll b/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
index 4053b8a8e123b1c..be0b9820e145415 100644
--- a/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
+++ b/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
@@ -1,6 +1,5 @@
 ;; See also llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
 ; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
-; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=0 -ext-tsp-enable-chain-split-along-jumps=0 < %s | FileCheck %s -check-prefix=CHECK2
 
 define void @func1a()  {
 ; Test that the algorithm positions the most likely successor first
@@ -329,8 +328,8 @@ end:
 }
 
 define void @func4() !prof !11 {
-; Test verifying that, if enabled, chains can be split in order to improve the
-; objective (by creating more fallthroughs)
+; Test verifying that chains can be split in order to improve the objective
+; by cr...
[truncated]

@david-xl
Copy link
Contributor

david-xl commented Oct 9, 2023

Curious: why is benchmark 3's runtime increasing (marked as benchmark 2 in Before)?

@spupyrev
Copy link
Contributor Author

spupyrev commented Oct 9, 2023

Curious: why is benchmark 3's runtime increasing (marked as benchmark 2 in Before)?

I wasn't particularly careful with measuring build times here, as these are seconds, while the overall compilation takes minutes/hours. We should treat the "reordering runtime" here as given with a +/- 10% precision, so the difference for "benchmark 3" (sorry for the typo) is likely just noise. In my mind, the times for benchmarks 1,2,3 are unchanged, while there is small but measurable (~1 sec) improvement for benchmark 4, and a large improvement for benchmark 5.
Let me know if you'd like to get higher precision.

// Do not split the chain along a jump.
const NodeT *BB = ChainPred->Nodes[Offset - 1];
const NodeT *BB2 = ChainPred->Nodes[Offset];
if (BB->isSuccessor(BB2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for this? Sometimes a loop is rotated to improve overall ext-tsp score even though the chain is split along a jump.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely for speeding up the algorithm. You're right that it results in (slightly) worse solutions but we get a significant speedup.
I expanded the comment here: We still can rotate loops via one of the loops above. This check is essentially prevents creating solutions with fewer (unweighted) fallthroughs.

while (HotChains.size() > 1) {
ChainT *BestChainPred = nullptr;
ChainT *BestChainSucc = nullptr;
MergeGainT BestGain;
// Iterate over all pairs of chains.
for (ChainT *ChainPred : HotChains) {
// Since the score of merging doesn't increase, we can stop early when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always true? What if merging two chains A and B allows the total gain of merging <A.B> with another chain C to increase because both A and B have branches to C?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is for speeding up the algorithm. The empirical speedup seems to be worth the quality regression.
Adjusted the comment.

@spupyrev spupyrev requested review from JDevlieghere, nikic and a team as code owners October 13, 2023 00:23
@spupyrev spupyrev marked this pull request as draft October 13, 2023 00:24
@spupyrev spupyrev closed this Oct 13, 2023
@spupyrev spupyrev removed request for a team, nikic and JDevlieghere October 13, 2023 00:26
@spupyrev spupyrev reopened this Oct 13, 2023
@github-actions
Copy link

github-actions bot commented Oct 13, 2023

✅ With the latest revision this PR passed the Python code formatter.

@github-actions
Copy link

github-actions bot commented Oct 13, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@spupyrev spupyrev marked this pull request as ready for review October 13, 2023 00:47
@spupyrev
Copy link
Contributor Author

spupyrev commented Oct 13, 2023

After more thorough testing, I had to bring back ChainSplitThreshold and modify the tests accordingly as otherwise some tests were up to 4x slower. Having this threshold regresses the quality a little bit, but I don't see a proper way of removing it.
Adjusted the numbers on my benchmarks to reflect the change.

@@ -725,6 +734,7 @@ class ExtTSPImpl {
return std::make_tuple(A1->Id, B1->Id) < std::make_tuple(A2->Id, B2->Id);
};

double PrevScore = 1e9;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use a sentinel value instead of a magic 1e9.

You can use std::optional<double> for an absent value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the variable, as it seems to confuse people while the benefit is modest

@@ -1,6 +1,5 @@
;; See also llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=0 -ext-tsp-enable-chain-split-along-jumps=0 < %s | FileCheck %s -check-prefix=CHECK2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the -ext-tsp-chain-split-threshold test removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test relies on another option, ext-tsp-enable-chain-split-along-jumps, which is now deleted.

@@ -753,15 +774,23 @@ class ExtTSPImpl {
BestGain = CurGain;
BestChainPred = ChainPred;
BestChainSucc = ChainSucc;
// Stop early when the merge is as good as the previous one.
Copy link
Member

@MaskRay MaskRay Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PrevScore is changed to a std::optional, add an assert(!PrevScore || BestGain.score() <= *PrevScore + EPS)

@spupyrev
Copy link
Contributor Author

Are there any more comments or suggestions? Thanks

// Stop early if the combined chain violates the maximum allowed size.
if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
continue;
// Don't merge the chains if they have vastly different densities.
// We stop early if the ratio between the densities exceeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip rather than stop?

// We stop early if the ratio between the densities exceeds
// MaxMergeDensityRatio. Smaller values of the option result in
// fewer merges (hence, more chains), which in turn typically yields
// smaller size of the hot code section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this comment. How can the hot section get smaller? Yes, we don't merge the chains, but they still get concatenated at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the comment (and the one above). Please check if the new one makes sense. Thanks

@@ -446,6 +448,14 @@ struct ChainEdge {
bool CacheValidBackward{false};
};

bool NodeT::isSuccessor(const NodeT *Other) const {
for (JumpT *Jump : OutJumps) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -741,12 +750,23 @@ class ExtTSPImpl {
// Get candidates for merging with the current chain.
for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
// Ignore loop edges.
if (ChainPred == ChainSucc)
if (Edge->isSelfEdge())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is often called self-loop or loop. If you think fine, I'll push a NFC commit renaming isSelfEdge to isSelfLoop...

@spupyrev spupyrev merged commit cc2fbc6 into llvm:main Oct 25, 2023
@spupyrev spupyrev deleted the exttsp-faster-v2 branch November 16, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants